Alex Arrieta, Rayan Gendre, Alvaro Ramos, Jackson Isidor
Code
library(tidyverse)library(here)library(knitr)library(gganimate)library(broom)# loading in our data setsgdp <-read_csv(here("data", "gdp_pcap.csv"))wateraccess <-read_csv(here("data", "at_least_basic_water_source_overall_access_percent.csv"))
1 Project Proposal
For this project, we are interested in assessing the relationship between a country’s GDP and the percentage of people within that country that have access to a basic water source. To do accomplish this, we will be using two datasets from Gapminder that correspond to our variables of interest.
1.1 The Data
Our two variables of interest (and their related datasets) are:
At least a basic water source, overall access %[Response]
This variable’s dataset displays the percent of people who are using at least basic water services by Country and by Year from 2000 to 2022. The percentage encompasses both the people who have access to basic water services as well as those who have access to safe water services. The dataset defines a basic water service as water from an improved source and collection is not more than 30 minutes for a round trip.
GDP per capita (Price and inflation adjusted)[Explanatory]
This variable’s dataset displays the Gross domestic product per person adjusted for differences in purchasing power, where each observation displays information for a Country in a given Year beginning in 1800 and going to present day. The values are displayed in international dollars and are fixed at 2017 prices.
1.2 Hypothesized Relationship
We hypothesize that there will be a positive relationship between the percentage of the country’s population that has access to water and the GDP per capita. This means that as the percentage of the population has access to water increases, then the GDP per capita of the country will also increase. Our basis for this is that countries with more prevalent global economies such as the United States and European countries have better access to water and safer modern ways for their citizens to access water. And when people have better access to water they can spend more time developing other areas of the country and business which leads to a more powerful economy and a better GDP per capita.
1.3 Data Cleaning
Because each variable has their own dataset, we will have to clean and combine the two datasets together to assess the relationship between the two variables. Starting with the GDP per capita dataset, some values have the letter k embedded alongside the numeric value, preventing us to pivot the dataset into a longer format. To get around this, we forced every column in our dataset to become character variables so every column is the same data type. Once our data has been pivoted, we could then address the issue of embedded k’s by extracting the numbers for values that did contain a k, and then multiplying these values by 1000 while also converting these values to a numeric data type. Afterwards, we converted the remaining variables (country and year) to an appropriate data type.
The process for cleaning the water access dataset was much easier than for the GDP per capita dataset, as we did not have to deal with the issue of embedded k’s. Therefore, all we had to do for this dataset was pivot and then convert the remaining columns/variables to an appropriate data type. Additionally, we dropped any observtions that contained any missing or NA values.
Once each dataset had been pivoted and cleaned, we could then join our two datasets through an inner join by country and by year. To help with visualization, we also created a new variable called region which indicates the world region that a country resides in.
Code
# Pivoting and cleaning the gdp datasetgdp_clean <- gdp |>mutate(across(.cols =everything(),.f =~as.character(.x) ) ) |>pivot_longer(cols =`1800`:`2100`, names_to ="year", values_to ="gdp_pc" ) |>mutate(gdp_pc =if_else(condition =str_detect(gdp_pc, "k$"),true =as.numeric(str_extract(gdp_pc, "[0-9|.]*")) *1000,false =as.numeric(gdp_pc) ),country =as.factor(country),year =as.numeric(year) )# Pivoting and cleaning the water_access datasetwateraccess_clean <- wateraccess |>pivot_longer(cols =`2000`:`2022`,names_to ="year",values_to ="access_pct",values_drop_na =TRUE ) |>mutate(country =as.factor(country),year =as.numeric(year),access_pct =as.numeric(access_pct) )# Joining the two datasetsgdp_v_water <- gdp_clean |>inner_join(wateraccess_clean,by =join_by(country, year) ) |>mutate(region =fct_collapse(country,"Asia"=c("Afghanistan","Australia","Bahrain","Bangladesh","Bhutan","Brunei","Cambodia","China","Hong Kong, China","Fiji","India","Indonesia","Iran","Iraq","Israel","Japan","Jordan","Kazakhstan","Kiribati","Kuwait","Kyrgyz Republic","Lao","Lebanon","Malaysia","Maldives","Marshall Islands","Micronesia, Fed. Sts.","Mongolia","Myanmar","Nauru","Nepal","New Zealand","North Korea","Oman","Pakistan","Palestine","Palau","Papua New Guinea","Philippines","Qatar","Samoa","Saudi Arabia","Singapore","Solomon Islands","South Korea","Sri Lanka","Syria","Tajikistan","Thailand","Timor-Leste","Tonga","Turkmenistan","Tuvalu","UAE","Uzbekistan","Vanuatu","Vietnam","Yemen"),"Africa"=c("Algeria","Angola","Benin","Botswana","Burkina Faso","Burundi","Cameroon","Cape Verde","Central African Republic","Chad","Comoros","Congo, Dem. Rep.","Congo, Rep.","Cote d'Ivoire","Djibouti","Egypt","Equatorial Guinea","Eritrea","Eswatini","Ethiopia","Gabon","Gambia","Ghana","Guinea","Guinea-Bissau","Kenya","Lesotho","Liberia","Libya","Madagascar","Malawi","Mali","Mauritania","Mauritius","Morocco","Mozambique","Namibia","Niger","Nigeria","Palau","Rwanda","Sao Tome and Principe","Senegal","Seychelles","Sierra Leone","Somalia","South Africa","South Sudan","Sudan","Tanzania","Togo","Tunisia","Uganda","Zambia","Zimbabwe"),"Americas"=c("Antigua and Barbuda","Argentina","Bahamas","Barbados","Belize","Bolivia","Brazil","Canada","Chile","Colombia","Costa Rica","Cuba","Dominica","Dominican Republic","Ecuador","El Salvador","Grenada","Guatemala","Guyana","Haiti","Honduras","Jamaica","Mexico","Nicaragua","Panama","Paraguay","Peru","St. Kitts and Nevis","St. Lucia","St. Vincent and the Grenadines","Suriname","Trinidad and Tobago","USA","Uruguay","Venezuela"),"Europe"=c("Albania","Andorra","Armenia","Austria","Azerbaijan","Belarus","Belgium","Bosnia and Herzegovina","Bulgaria","Croatia","Cyprus","Czech Republic","Denmark","Estonia","Finland","France","Georgia","Germany","Greece","Hungary","Iceland","Ireland","Italy","Latvia","Lithuania","Luxembourg","Malta","Moldova","Monaco","Montenegro","Netherlands","North Macedonia","Norway","Poland","Portugal","Romania","Russia","San Marino","Serbia","Slovak Republic","Slovenia","Spain","Sweden","Switzerland","Turkey","UK","Ukraine"),other_level ="other" ),.after = country )kable(head(gdp_v_water), format ="markdown")
country
region
year
gdp_pc
access_pct
Afghanistan
Asia
2000
794
27.4
Afghanistan
Asia
2001
775
27.5
Afghanistan
Asia
2002
1260
29.7
Afghanistan
Asia
2003
1280
31.9
Afghanistan
Asia
2004
1260
34.1
Afghanistan
Asia
2005
1350
36.3
2 Assessing the Linear Relationship
Now that we have our combined dataset, we can now assess the relationship between a country’s GDP per capita and the proportion of their population that has access to water. Before formally analyzing this relationship through a fitted linear regression model, we would like to look at the relationship between the two variables over time.
2.1 Data Visualization
Code
gdp_v_water |>ggplot(aes(x = gdp_pc,y = access_pct,fill = region,color = region,shape = region) ) +geom_point() +theme_bw() +scale_color_manual(values =c("#e66101","#fdb863","#b2abd2","#5e3c99") )+scale_fill_manual(values =c("#e66101","#fdb863","#b2abd2","#5e3c99") ) +scale_shape_manual(values =21:24) +labs(x ="GDP Per Capita",y ="",subtitle ="% of Pop. with Basic Water Access",title ="Relationship between Water Access and GDP",tag ="Year: {frame_time}",fill ="World Region",color ="World Region",shape ="World Region" ) +transition_time(as.integer(year)) +ease_aes('linear')
When assessing the relationship between Water Access vs. GDP over time through an animated plot, we can see that there generally seems to be a strong, positive non-linear relationship between the two variables; while countries with larger GDP per capita tend to have higher rates of water access, this trend only seems to apply when looking at countries that have a GDP per Capita of about $50,000 or less. Countries with a GDP per capita of $50,000 or greater will more often than not have almost 100% of their population have access to a basic water source. Additionally, as time goes on, we see that there seems to be an improvement overall in the percentage of the population that have water access for all countries, with most countries having been able to increase this percentage over time. One other notable observation that can be made from this animated plot is that the world region with seemingly the highest percentage of water access on average is Europe, while the region with the lowest percentage on average seems to be Africa (which makes sense as European countries seem to have the highest GDP per capita while African countries seem to have the lowest).
Because we can see that the relationship between GDP per capita and Water Access is far from being linear, we would like to perform a logit transformation on the average proportion of a country’s population that has basic water access (our response variable) before fitting a linear model to help with this issue of non-linearity using the following formula:
ln(average water access / (101 - average water access))
A plot containing a fitted linear model alongside our transformed response variable follows.
Code
gdp_v_water_mean <- gdp_v_water |>group_by(country, region) |>summarize(mean_gdp =mean(gdp_pc),mean_access =mean(access_pct) ) |>mutate(logit_meanaccess =log((mean_access)/(101- mean_access)))gdp_v_water_mean |>ggplot(aes(x = mean_gdp,y = logit_meanaccess ) ) +geom_point() +geom_smooth(method ="lm",aes(x = mean_gdp,y = logit_meanaccess,group =1),color ="red" ) +theme_bw() +labs(x ="GDP Per Capita (averaged over the years)",y ="",subtitle ="Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between LOGIT avg. Water Access and avg. GDP" ) +scale_y_continuous(limits =c(0, 6))
From this plot, we can see that even with a transformation, there still seems to be a somewhat weak, positive linear relationship between the average GDP per capita of a country and the logit of the population percentage with access to a basic water source, though this relationship is much improved in terms of linearity over the relationship with the original non transformed response. Additionally, we can also see that there may be some data points in which some people may consider outliers. Most notably, there is one country that has a relatively high GDP per capita compared to the other countries. By rearranging our combined dataset by the mean GDP per capita in descending order and grabbing the first observation, we can determine that this country with the highest average GDP per capita is Monaco, a country in Europe with an average GDP per capita of $186,521.74 and a logit value of 4.6052.
Our estimated linear regression model (after performing a logit transformation on our response variable) is:
ŷ = predicted ln(mean water access/(101- mean water access)) = 1.516 + 0.0000463(mean GDP per capita).
Since the intercept for our model is 1.516, then the logit value for countries with an average GDP per capita of zero is 1.516 (corresponding to a population percentage value of 82.83, which was obtained by undoing the original logit transformation through an inverse). Similarly, because our model’s slope is 0.0000463, then when a country’s GDP per capita increases by one dollar, the logit value of the population percentage that has access to a basic water supply is expected to increase by 0.0000463.
To assess our model’s fit, we can use variances for the response, fitted, and residual values. From here, we can calculate the proportion of the variability seen in our response variable that can be explained by our estimated linear regression model, which in this case is:
1.156/2.455 = 1-(1.300/2.455) = 0.471
Only about 47.1% of the variation in the response can be accounted for by our model, which indicates that we have a very poor model, and suggests that there is a very weak linear relationship between the GDP per capita of a country and the logit value of the percentage of the country’s population that has access to a basic water source. Therefore, a linear model may not be the best way to describe the relationship between these two variables, which confirms what we saw earlier when plotting the linear regression model.
3 Simulation
To better assess how our fitted linear model performs at explaining the total variability seen in the response variable, we can do some simulation!
3.1 Visualizating Simulations from the Model
To start off, we can generate simulated observations by using the predicted values from our linear regression model and add random errors to each one using the residual standard error from the same model.
Code
set.seed(7256)# Creating function that adds random noise to simulate 'observed' valuesaddnoise <-function(x, mean =0, sd) { new_x <- x +rnorm(length(x), mean, sd )return(new_x)}# Extracting predicted values and standard deviation of residuals from fitted modellogit_predict <-predict(model)resid_sigma <-sigma(model)# Testing function and creating a set of simulated observationssim_logit <-tibble(sim_logit_meanaccess =addnoise(x = logit_predict, sd = resid_sigma))sim_logit
# Adding newly generated observations to a dataset containing the original observed valuessim_data <- gdp_v_water_mean |>ungroup() |>filter(is.na(logit_meanaccess) ==FALSE,is.na(mean_access) ==FALSE ) |>select(mean_gdp, logit_meanaccess) |>bind_cols(sim_logit)# Visualizing and comparing the simulation from our model to the original datasim_data |>ggplot(aes(x = mean_gdp,y = logit_meanaccess ) ) +geom_point() +geom_smooth(method ="lm",aes(x = mean_gdp,y = logit_meanaccess,group =1),color ="red" ) +theme_bw() +labs(x ="GDP Per Capita (averaged over the years)",y ="",subtitle ="Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between Observed LOGIT avg. Water Access and avg. GDP" ) +scale_y_continuous(limits =c(0, 6))
Code
sim_data |>ggplot(aes(x = mean_gdp,y = sim_logit_meanaccess ) ) +geom_point() +geom_smooth(method ="lm",aes(x = mean_gdp,y = logit_meanaccess,group =1),color ="red" ) +theme_bw() +labs(x ="GDP Per Capita (averaged over the years)",y ="",subtitle ="Simulated Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between Simulated LOGIT avg. Water Access and avg. GDP" ) +scale_y_continuous(limits =c(0, 6))
Code
sim_data |>ggplot(aes(x = sim_logit_meanaccess,y = logit_meanaccess), ) +geom_point() +scale_x_continuous(limits =c(0, 10)) +scale_y_continuous(limits =c(0, 10)) +geom_abline(slope =1, intercept =0, color ="red", linewidth =1) +theme_bw() +labs(x ="Simulated Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",y ="",subtitle ="Observed Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between Simulated LOGIT avg. Water Access and Observed LOGIT avg. Water Access" )
3.2 Generating Multiple Predictive Checks
Instead of creating just one dataset of simulated observations, we can generate 1000 simulated datasets. From here, we can regress the observed dataset against each one of the 1000 simulated datasets to generate multiple R^2 values, from where we can assess the distribution of the R^2 values to determine how well our original linear regression model describes the dataset (by determining how much variability seen in the observed logit water access is accounted for by our similated data based on the linear model).
Code
set.seed(7256)# Generating 1000 simulated sets of observations and putting them into one datasetnsims <-1000sim_data2 <-map_dfc(.x =1:nsims,.f =~tibble(sim =addnoise(logit_predict,sd = resid_sigma ) ) )# Cleaning the simulated datasetcolnames(sim_data2) <-colnames(sim_data2) |>str_replace(pattern ="\\.\\.\\.",replace ="_" )# Combining this simulated dataset with a dataset containing the original observed valuessim_data2 <- gdp_v_water_mean |>ungroup() |>filter(is.na(mean_gdp) ==FALSE,is.na(logit_meanaccess) ==FALSE ) |>select(logit_meanaccess) |>bind_cols(sim_data2)head(sim_data)
# Regressing observed values to each set of simulated values to generate r-squared valuesreg_sim <-map(.x = sim_data2,.f =~lm(logit_meanaccess ~ .x,data = sim_data2) ) |>map(.f =~glance(.x)) |>map_dbl(.f =~ .x$r.squared)# Removing the r-squared value corresponding to logit_meanaccess ~ logitmeanaccessreg_sim <- reg_sim[names(reg_sim) !="logit_meanaccess"]reg_sim
---title: "Final Project Written Report"author: "Alex Arrieta, Rayan Gendre, Alvaro Ramos, Jackson Isidor"format: html: embed-resources: true self-contained: true code-tools: true toc: true code-fold: true theme: flatly number-sections: trueeditor: sourceexecute: error: true echo: true message: false warning: false---```{r setup}library(tidyverse)library(here)library(knitr)library(gganimate)library(broom)# loading in our data setsgdp <- read_csv(here("data", "gdp_pcap.csv"))wateraccess <- read_csv(here("data", "at_least_basic_water_source_overall_access_percent.csv"))```## Project ProposalFor this project, we are interested in assessing the relationship between a country's GDP and the percentage of people within that country that have access to a basic water source. To do accomplish this, we will be using two datasets from Gapminder that correspond to our variables of interest.### The DataOur two variables of interest (and their related datasets) are:**At least a basic water source, overall access %** ***[Response]***This variable's dataset displays the percent of people who are using at least basic water services by Country and by Year from 2000 to 2022. The percentage encompasses both the people who have access to basic water services as well as those who have access to safe water services. The dataset defines a basic water service as water from an improved source and collection is not more than 30 minutes for a round trip.**GDP per capita (Price and inflation adjusted)** ***[Explanatory]***This variable's dataset displays the Gross domestic product per person adjusted for differences in purchasing power, where each observation displays information for a Country in a given Year beginning in 1800 and going to present day. The values are displayed in international dollars and are fixed at 2017 prices.### Hypothesized RelationshipWe hypothesize that there will be a positive relationship between the percentage of the country's population that has access to water and the GDP per capita. This means that as the percentage of the population has access to water increases, then the GDP per capita of the country will also increase. Our basis for this is that countries with more prevalent global economies such as the United States and European countries have better access to water and safer modern ways for their citizens to access water. And when people have better access to water they can spend more time developing other areas of the country and business which leads to a more powerful economy and a better GDP per capita.### Data CleaningBecause each variable has their own dataset, we will have to clean and combine the two datasets together to assess the relationship between the two variables. Starting with the GDP per capita dataset, some values have the letter `k` embedded alongside the numeric value, preventing us to pivot the dataset into a longer format. To get around this, we forced every column in our dataset to become character variables so every column is the same data type. Once our data has been pivoted, we could then address the issue of embedded `k`'s by extracting the numbers for values that did contain a `k`, and then multiplying these values by 1000 while also converting these values to a numeric data type. Afterwards, we converted the remaining variables (country and year) to an appropriate data type.The process for cleaning the water access dataset was much easier than for the GDP per capita dataset, as we did not have to deal with the issue of embedded `k`'s. Therefore, all we had to do for this dataset was pivot and then convert the remaining columns/variables to an appropriate data type. Additionally, we dropped any observtions that contained any missing or NA values.Once each dataset had been pivoted and cleaned, we could then join our two datasets through an inner join by country and by year. To help with visualization, we also created a new variable called `region` which indicates the world region that a country resides in.```{r}# Pivoting and cleaning the gdp datasetgdp_clean <- gdp |>mutate(across(.cols =everything(),.f =~as.character(.x) ) ) |>pivot_longer(cols =`1800`:`2100`, names_to ="year", values_to ="gdp_pc" ) |>mutate(gdp_pc =if_else(condition =str_detect(gdp_pc, "k$"),true =as.numeric(str_extract(gdp_pc, "[0-9|.]*")) *1000,false =as.numeric(gdp_pc) ),country =as.factor(country),year =as.numeric(year) )# Pivoting and cleaning the water_access datasetwateraccess_clean <- wateraccess |>pivot_longer(cols =`2000`:`2022`,names_to ="year",values_to ="access_pct",values_drop_na =TRUE ) |>mutate(country =as.factor(country),year =as.numeric(year),access_pct =as.numeric(access_pct) )# Joining the two datasetsgdp_v_water <- gdp_clean |>inner_join(wateraccess_clean,by =join_by(country, year) ) |>mutate(region =fct_collapse(country,"Asia"=c("Afghanistan","Australia","Bahrain","Bangladesh","Bhutan","Brunei","Cambodia","China","Hong Kong, China","Fiji","India","Indonesia","Iran","Iraq","Israel","Japan","Jordan","Kazakhstan","Kiribati","Kuwait","Kyrgyz Republic","Lao","Lebanon","Malaysia","Maldives","Marshall Islands","Micronesia, Fed. Sts.","Mongolia","Myanmar","Nauru","Nepal","New Zealand","North Korea","Oman","Pakistan","Palestine","Palau","Papua New Guinea","Philippines","Qatar","Samoa","Saudi Arabia","Singapore","Solomon Islands","South Korea","Sri Lanka","Syria","Tajikistan","Thailand","Timor-Leste","Tonga","Turkmenistan","Tuvalu","UAE","Uzbekistan","Vanuatu","Vietnam","Yemen"),"Africa"=c("Algeria","Angola","Benin","Botswana","Burkina Faso","Burundi","Cameroon","Cape Verde","Central African Republic","Chad","Comoros","Congo, Dem. Rep.","Congo, Rep.","Cote d'Ivoire","Djibouti","Egypt","Equatorial Guinea","Eritrea","Eswatini","Ethiopia","Gabon","Gambia","Ghana","Guinea","Guinea-Bissau","Kenya","Lesotho","Liberia","Libya","Madagascar","Malawi","Mali","Mauritania","Mauritius","Morocco","Mozambique","Namibia","Niger","Nigeria","Palau","Rwanda","Sao Tome and Principe","Senegal","Seychelles","Sierra Leone","Somalia","South Africa","South Sudan","Sudan","Tanzania","Togo","Tunisia","Uganda","Zambia","Zimbabwe"),"Americas"=c("Antigua and Barbuda","Argentina","Bahamas","Barbados","Belize","Bolivia","Brazil","Canada","Chile","Colombia","Costa Rica","Cuba","Dominica","Dominican Republic","Ecuador","El Salvador","Grenada","Guatemala","Guyana","Haiti","Honduras","Jamaica","Mexico","Nicaragua","Panama","Paraguay","Peru","St. Kitts and Nevis","St. Lucia","St. Vincent and the Grenadines","Suriname","Trinidad and Tobago","USA","Uruguay","Venezuela"),"Europe"=c("Albania","Andorra","Armenia","Austria","Azerbaijan","Belarus","Belgium","Bosnia and Herzegovina","Bulgaria","Croatia","Cyprus","Czech Republic","Denmark","Estonia","Finland","France","Georgia","Germany","Greece","Hungary","Iceland","Ireland","Italy","Latvia","Lithuania","Luxembourg","Malta","Moldova","Monaco","Montenegro","Netherlands","North Macedonia","Norway","Poland","Portugal","Romania","Russia","San Marino","Serbia","Slovak Republic","Slovenia","Spain","Sweden","Switzerland","Turkey","UK","Ukraine"),other_level ="other" ),.after = country )kable(head(gdp_v_water), format ="markdown")```## Assessing the Linear RelationshipNow that we have our combined dataset, we can now assess the relationship between a country's GDP per capita and the proportion of their population that has access to water. Before formally analyzing this relationship through a fitted linear regression model, we would like to look at the relationship between the two variables over time.### Data Visualization```{r}gdp_v_water |>ggplot(aes(x = gdp_pc,y = access_pct,fill = region,color = region,shape = region) ) +geom_point() +theme_bw() +scale_color_manual(values =c("#e66101","#fdb863","#b2abd2","#5e3c99") )+scale_fill_manual(values =c("#e66101","#fdb863","#b2abd2","#5e3c99") ) +scale_shape_manual(values =21:24) +labs(x ="GDP Per Capita",y ="",subtitle ="% of Pop. with Basic Water Access",title ="Relationship between Water Access and GDP",tag ="Year: {frame_time}",fill ="World Region",color ="World Region",shape ="World Region" ) +transition_time(as.integer(year)) +ease_aes('linear')```When assessing the relationship between Water Access vs. GDP over time through an animated plot, we can see that there generally seems to be a strong, positive **non-linear** relationship between the two variables; while countries with larger GDP per capita tend to have higher rates of water access, this trend only seems to apply when looking at countries that have a GDP per Capita of about $50,000 or less. Countries with a GDP per capita of $50,000 or greater will more often than not have almost 100% of their population have access to a basic water source. Additionally, as time goes on, we see that there seems to be an improvement overall in the percentage of the population that have water access for all countries, with most countries having been able to increase this percentage over time.One other notable observation that can be made from this animated plot is that the world region with seemingly the highest percentage of water access on average is Europe, while the region with the lowest percentage on average seems to be Africa (which makes sense as European countries seem to have the highest GDP per capita while African countries seem to have the lowest).Because we can see that the relationship between GDP per capita and Water Access is far from being linear, we would like to perform a logit transformation on the average proportion of a country's population that has basic water access (our response variable) before fitting a linear model to help with this issue of non-linearity using the following formula:ln(*average water access* / (101 - *average water access*))A plot containing a fitted linear model alongside our transformed response variable follows.```{r}gdp_v_water_mean <- gdp_v_water |>group_by(country, region) |>summarize(mean_gdp =mean(gdp_pc),mean_access =mean(access_pct) ) |>mutate(logit_meanaccess =log((mean_access)/(101- mean_access)))gdp_v_water_mean |>ggplot(aes(x = mean_gdp,y = logit_meanaccess ) ) +geom_point() +geom_smooth(method ="lm",aes(x = mean_gdp,y = logit_meanaccess,group =1),color ="red" ) +theme_bw() +labs(x ="GDP Per Capita (averaged over the years)",y ="",subtitle ="Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between LOGIT avg. Water Access and avg. GDP" ) +scale_y_continuous(limits =c(0, 6))```From this plot, we can see that even with a transformation, there still seems to be a somewhat weak, positive linear relationship between the average GDP per capita of a country and the logit of the population percentage with access to a basic water source, though this relationship is much improved in terms of linearity over the relationship with the original non transformed response. Additionally, we can also see that there may be some data points in which some people may consider outliers. Most notably, there is one country that has a relatively high GDP per capita compared to the other countries. By rearranging our combined dataset by the mean GDP per capita in descending order and grabbing the first observation, we can determine that this country with the highest average GDP per capita is Monaco, a country in Europe with an average GDP per capita of $186,521.74 and a logit value of 4.6052.```{r}#| output: falsegdp_v_water_mean |>ungroup() |>slice_max(order_by = mean_gdp,n =1)```### Linear RegressionIn addition to our plot containing a linear regression model, we can also create and obtain the coefficients for this model.```{r}#| output: truemodel <-lm(logit_meanaccess ~ mean_gdp, data = gdp_v_water_mean)model#summary(model)#augment(model)#anova(model)```Our estimated linear regression model (after performing a logit transformation on our response variable) is: ŷ = predicted ln(*mean water access*/(101- *mean water access*)) = 1.516 + 0.0000463(*mean GDP per capita*). Since the intercept for our model is 1.516, then the logit value for countries with an average GDP per capita of zero is 1.516 (corresponding to a population percentage value of `r round(101*(1/(1+2.72^(-1.516))), 2)`, which was obtained by undoing the original logit transformation through an inverse). Similarly, because our model's slope is 0.0000463, then when a country's GDP per capita increases by one dollar, the logit value of the population percentage that has access to a basic water supply is expected to increase by 0.0000463.### Model Fit(INCLUDE ASSESSMENT OF LINE CONDITIONS HERE)```{r}kable(tibble(LogitAccess_Variance =var(augment(model)$logit_meanaccess),Prediction_Variance =var(augment(model)$.fitted),Residual_Variance =var(augment(model)$.resid) ))```To assess our model's fit, we can use variances for the response, fitted, and residual values. From here, we can calculate the proportion of the variability seen in our response variable that can be explained by our estimated linear regression model, which in this case is:1.156/2.455 = 1-(1.300/2.455) = 0.471Only about 47.1% of the variation in the response can be accounted for by our model, which indicates that we have a very poor model, and suggests that there is a very weak linear relationship between the GDP per capita of a country and the logit value of the percentage of the country's population that has access to a basic water source. Therefore, a linear model may not be the best way to describe the relationship between these two variables, which confirms what we saw earlier when plotting the linear regression model.## SimulationTo better assess how our fitted linear model performs at explaining the total variability seen in the response variable, we can do some simulation!### Visualizating Simulations from the ModelTo start off, we can generate simulated observations by using the predicted values from our linear regression model and add random errors to each one using the residual standard error from the same model.```{r}set.seed(7256)# Creating function that adds random noise to simulate 'observed' valuesaddnoise <-function(x, mean =0, sd) { new_x <- x +rnorm(length(x), mean, sd )return(new_x)}# Extracting predicted values and standard deviation of residuals from fitted modellogit_predict <-predict(model)resid_sigma <-sigma(model)# Testing function and creating a set of simulated observationssim_logit <-tibble(sim_logit_meanaccess =addnoise(x = logit_predict, sd = resid_sigma))sim_logit# Adding newly generated observations to a dataset containing the original observed valuessim_data <- gdp_v_water_mean |>ungroup() |>filter(is.na(logit_meanaccess) ==FALSE,is.na(mean_access) ==FALSE ) |>select(mean_gdp, logit_meanaccess) |>bind_cols(sim_logit)# Visualizing and comparing the simulation from our model to the original datasim_data |>ggplot(aes(x = mean_gdp,y = logit_meanaccess ) ) +geom_point() +geom_smooth(method ="lm",aes(x = mean_gdp,y = logit_meanaccess,group =1),color ="red" ) +theme_bw() +labs(x ="GDP Per Capita (averaged over the years)",y ="",subtitle ="Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between Observed LOGIT avg. Water Access and avg. GDP" ) +scale_y_continuous(limits =c(0, 6))sim_data |>ggplot(aes(x = mean_gdp,y = sim_logit_meanaccess ) ) +geom_point() +geom_smooth(method ="lm",aes(x = mean_gdp,y = logit_meanaccess,group =1),color ="red" ) +theme_bw() +labs(x ="GDP Per Capita (averaged over the years)",y ="",subtitle ="Simulated Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between Simulated LOGIT avg. Water Access and avg. GDP" ) +scale_y_continuous(limits =c(0, 6))sim_data |>ggplot(aes(x = sim_logit_meanaccess,y = logit_meanaccess), ) +geom_point() +scale_x_continuous(limits =c(0, 10)) +scale_y_continuous(limits =c(0, 10)) +geom_abline(slope =1, intercept =0, color ="red", linewidth =1) +theme_bw() +labs(x ="Simulated Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",y ="",subtitle ="Observed Logit of the Pop. Percentage with Basic Water Access (averaged over the years)",title ="Relationship between Simulated LOGIT avg. Water Access and Observed LOGIT avg. Water Access" )```### Generating Multiple Predictive ChecksInstead of creating just one dataset of simulated observations, we can generate 1000 simulated datasets. From here, we can regress the observed dataset against each one of the 1000 simulated datasets to generate multiple R^2 values, from where we can assess the distribution of the R^2 values to determine how well our original linear regression model describes the dataset (by determining how much variability seen in the observed logit water access is accounted for by our similated data based on the linear model).```{r}set.seed(7256)# Generating 1000 simulated sets of observations and putting them into one datasetnsims <-1000sim_data2 <-map_dfc(.x =1:nsims,.f =~tibble(sim =addnoise(logit_predict,sd = resid_sigma ) ) )# Cleaning the simulated datasetcolnames(sim_data2) <-colnames(sim_data2) |>str_replace(pattern ="\\.\\.\\.",replace ="_" )# Combining this simulated dataset with a dataset containing the original observed valuessim_data2 <- gdp_v_water_mean |>ungroup() |>filter(is.na(mean_gdp) ==FALSE,is.na(logit_meanaccess) ==FALSE ) |>select(logit_meanaccess) |>bind_cols(sim_data2)head(sim_data)# Regressing observed values to each set of simulated values to generate r-squared valuesreg_sim <-map(.x = sim_data2,.f =~lm(logit_meanaccess ~ .x,data = sim_data2) ) |>map(.f =~glance(.x)) |>map_dbl(.f =~ .x$r.squared)# Removing the r-squared value corresponding to logit_meanaccess ~ logitmeanaccessreg_sim <- reg_sim[names(reg_sim) !="logit_meanaccess"]reg_sim# Creating a histogramtibble(rsquared = reg_sim) |>ggplot(aes(x = rsquared)) +geom_histogram(bins =30,color ="darkorange",fill ="orange")```